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It is shown that a Partitioned Optical Passive Stars (POPS) network with g groups and d processors per group can 
route any permutation among the n = dg processors in one slot when d = 1 and 2\d/g] slots when d > 1. The 
number of slots used is optimal in the worst case, and is at most the double of the optimum for all permutations 
It such that Jt(i') ^ ;', for all i. 
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1. INTRODUCTION 



The Partiti oned Optical Passive Star (POPS) network flChiarulli et al. 1994 ; Gravenstreter 



et al. 1995; |Gravenstreter and Melhem 1998[ |Melhem et al. 1998| | is a SIMP interconnec 



tion network that uses multiple optical passive star (OPS) couplers. A d x d OPS coupler 
(see Figure]]]) is an all-optical passive device which is capable of receiving an optical signal 
from one of its d sources and broadcast it to all of its d destinations. Being a passive all- 
optical technology it benefits from a number of characteristics such as no opto-electronic 
conversion, high noise immunity, and low latency. 

The number of processors of the network is denoted by n, and each processor has a 
distinct index in {0, ...,«— 1}. The n processors are partitioned into g — n/d groups 
in such a way that processor i belongs to group group(/) := [i/d\. It is assumed that 
d divides n, consequently, each group consists of d processors. For each pair of groups 
a, b 6 {0, . . . ,g — 1}, a coupler c(b, a) is introduced which has all the processors of group a 
as sources and all the processors of group b as destinations. The number of couplers used 
is g 2 . Such an architecture will be denoted by POPS(c/,g) (see Figure |2|). 

For all i € {0, ...,«— 1 }, processor i has g transmitters which are connected to couplers 




Fig. 1. A 4 x 4 Optical Passive Star (OPS) coupler. 
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Fig. 2. A POPS (3, 2). 

c(a,group(/)), a = 0, . . . ,g — 1. Similarly, processor i has g receivers connected to couplers 
c(group(/),fr), b = 0, . . . ,g — 1. During a step of computation, each processor in parallel: 

— Performs some local computations; 

— sends a packet to a subset of its transmitters; 

— receives a packet from one of its receivers. 

In order to avoid conflicts, there shouldn't be any pair of processors sending a packet to 
the same coupler. The time needed to perform such a step is referred to as a slot. 

One of the advantages of a POPS(<i,g) network is that its diameter is 1. A packet can be 
sent from processor i to processor j, i ^ j, in one slot by using coupler c(group(y'),group(/)). 
However, its bandwidth varies according to g. In a POPS(«, 1) network, only one packet 
can be sent through the single coupler per slot. On the other extreme, a POPS (1 , n) network 
is a highly expensive, fully interconnected optical network using n 2 OPS couplers. 

A one-to-all communication pattern can also be performed in only one slot in the fol- 
lowing way: Processor i (the speaker) sends the packet to all the couplers c(a,group(/)), 
a £ {0, ...,g—l}, during the same slot all the processors j, j £ {0, ...,«— 1}, can receive 
the packet trough coupler c(group(y), group (/)). 

The POPS network model has been used to develop a number of non trivial algo- 
rithms. Several common communication patterns are realized in [Gravenstreter and Mel- 
hem 1998]. Simulation algorithms for the mesh and hypercube interconnection networks 



can be found in [ Sahni 2000b ], Algorithms for data sum, prefix sum, co nsecutive sum , 
adjacent sum, and several data movement operations are also descr ibed in [ [Sahni 2000b| ], 



An algorithm for matrix multiplication is provided in [Sahni 2000a |. These algorithms are 
based on sophisticated communication patterns, which have been investigated one by one, 
and shown to be routable on a POPS (d,g) network. However, most of these patterns belong 
to a more general class of permutation routing problems whose routability on the POPS 
network was not known in general. In this paper, we show that a POPS(c/,g) network can 
efficiently route n = dg packets arranged in the n processors according to any permutation, 
generalizing and unifying several known results appeared in the recent literature. 
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2. DEFINITION OF THE PROBLEM AND RELATED WORK 

Let N„ := {0, 1, ... ,n — 1} denote the set of the first n natural numbers, and let K be a 
permutation of the set N„. A permutation routing problem consists of a set of n packets 
pOj • • ■ iPn-\- Packet /?,■ is stored in the local memory of processor ;, for all i E N„, and has 
a desired destination Jt(z'). The problem is to route the packets to their destinations in as 
few slots as possible. 

No general solution has been given for this problem on the POPS network. Efficient 
routings are known for a few particular permutations, which have been independently at- 
tacked, and most of them require one slot when d = 1 and 2 \d /g] slots when d > 1 . Here 
follow a few examples. 



In [ 3ravenstreter and Melhem 1998 1, a characterization is given of the permutation rout- 
ing problems that can be routed in a single slot. However, only a very restricted number of 
permutations fall in this class. Indeed, if two packets originating at the same group are to 
be routed to the same destination group, then one slot is obviously not enough to route all 
the packets. 



In [ Sahni 2000t ], several permutation routing problems are considered in the context of 
the simulation of hypercube and mesh-connected computers on the POPS network. As- 
sume that processor i of an n = 2 D processor SIMD hypercube is mapped onto processor i 
of a POPS(c/,g) network, dg = n. For every fixed b, < b < D, a primitive communication 
pattern is defined such that processor i sends a packet to processor v \ where v- b > is the 
number whose binary representation differs from that of i only in bit b. Each of the D 
communication patterns defined is a permutation routing problem. Theorem 1 of [Sahni 
2000b] shows that all of them can be routed in one slot when d — 1 and 2 \d/g~\ slots when 
d>l. 

The same result has been obtained when considering the problem of simulating anNxN 
SIMD mesh with wraparound, where data can be moved one processor up/down along 
the columns of the mesh, or right/left along the rows of the mesh. Again, assuming that 
processor of the mesh is mapped onto processor i + jN of a POPS(<f,g) network 



(dg = N and either d or g divides AO, Theorem 2 of [ Sahni 2000b ] shows that one slot 
when d = 1 and 2\d/g] slots when d > 1 are enough to route each of the four permutation 
routing problems. 

The routability of other specific permutation routing problems is investigated in [Sahni 
2000a]. For example, a vector reversal (a permutation routing problem, where Tt(z') = 
n — 1 — i, < i < n) is shown to be routable in one slot when d = 1 and 2\d/g] slots when 
d > 1 on a POPS (d,g) network, dg = n, which is optimal when g is even. To route a matrix 
transpose, conversely, \d/g ] is the optimal number of slots required. 

Moreover, [ [Sahni 2000a| ] considers BPC permutations. A BPC permutation is a rear- 
rangement of the bits of the source processor index, while some or all of the bits can be 
complemented. Formally, assume that n is a power of 2, n — 2 k , and that the binary repre- 
sentation of i is [i/c-iik-2 • • • io\2> tne set of BPC permutations is the smallest set BPC closed 
under composition such that: 

(1) Jt(z') = [i a (k-i)ia(k-2) ■ ■ ■ 'o(O)] 2 e BPC ' for a11 a permutation of N k ; 

(2) Jt(0 = • • Ty ■ ■ i ] 2 G BPC, for all j. 

Again, [ [Sahni 2000a | describes how BPC permutations can be routed in one slot when 



d = 1 and 2\d/g] slots when d > 1 on a POPS(ii,g) network, dg = n. 
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Fig. 3. Getting to a fair distribution on a POPS(3,3). Packets are drawn as circles next to their sources on the 
left. Inside each packet its destination x y can be found, where y is the index of the destina tion processor, and x is 
its group. On the right, the intermediate destination of the packet as described by Section [?.l[ 



In this paper we unify, generalize, and simplify the previously known results, by show- 
ing that a POPS(c/,g) network, dg = n, can route any permutation in one slot when d = 1 
and 2\d/g~\ slots when d > 1. This gives evidence of the versatility of the network. For 
example, a consequence of our Theore m ^ is that the simulation results for hypercube and 
mesh-connected computers shown in [ Sahni 2000b | do not depend on how the proces- 
sors of the simulated architecture are mapped onto the processors of the POPS network, 
provided that it is a one-to-one mapping, which is somewhat surprising. 



3. ROUTING PERMUTATIONS IN THE POPS NETWORK 

Assume the permutation routing problem defined by 71 on a POPS^g) network, dg = n, 
where K is a permutation of N„. Our goal is to prove that K can be routed in one slot when 
d = 1 and 2 \d /g~\ slots when d > 1 . 

We start, for the ease of explanation, from the case d = g = ^Jn. In this case, for most 
permutations one slot is not enough to route all the packets to destination. Take, as an 
example, the permutation shown in Figure ||| Packets starting from processor 4 and pro- 
cessor 5, both belonging to group 1, have the same group as desired destination. If only 
one slot is allowed, there is an unavoidable conflict on coupler c(0, 1). Hence, two slots 
are necessary to route K. 

It is not hard to find a sufficient condition for a set of packets to be routable in one slot. 
We will say that m packets, each with a different destination, are arranged according to a 
fair distribution in a POPS (d, g) network if no two packets are stored in the same processor, 
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and no two packets with the same destination group are stored in the same group. In this 
case, we will also say that the packets are fairly distributed. 

It is straightforward to see that a fairly distributed set of packets is routable in one slot. 
Indeed, no conflict occurs on any coupler. 

FACT 1 . In a POPS (d, g) network, a fairly distributed set ofm packets can be routed to 
destination in one slot. 

When d = g = y/n, only a very small number of permutations can be routed in one slot. 
However, we will show that all of them can be routed in two slots. The idea is that one 
slot is always enough to move a set of n packets arranged according to Jt in such a way 
to become fairly distributed. Then, a second one routes all the packets to destination by 
Fact[l|. 



Next, in Subsection 3.1, we formalize the above intuition, and demonstrate our claim, 
properly generalized in order to deal with any value of d and g. Note that, for a set of 
packets to be fairly distributed, we don't really need to care about their processor desti- 
nation. What we need is just to know what group destination each packet has. Thus, in 



Subsection 3.1 we can reduce our discussion to source groups and destination groups, d 



(3) 



packets originate at each source group, and d packets have a specific destination group. 
3.1 Permutation Routing: Getting to a Fair Distribution 

A list system is a triple (S,T,l), where S is a set of n\ :— \S\ source nodes, T is a set of 
«2 := | T | target nodes, and L : S x N A , >— > S assigns a list L s of Ai < ri2 not necessarily 
distinct elements from S to every source node s £ S. We also let l(s,s ! ) specify how many 
times the element s' £ S appears into list L s . A list system is called proper when n% divides 
njAi, and L iG s/(s,s') — ^i for every s 1 G S. 

Let A2 := '-^j^ 1 - A fair distribution is an assignment / : S x Naj <— > T such that 

I {f(s, i) I t e N Al } I = Ai for every s e S; ( 1 ) 

|{(a,0eSxN Al |/(M)=f}|=A 2 fbreveryfer; (2) 
if (si,h) ^ (52,12) and£(5i,2'i) =X (52,12), then/(ii,z'i) 7^/(52, 12), 
for every s\ , S2 € S and every i\,i2 G Ai . 

THEOREM 1. Every proper list system admits a fair distribution. 

PROOF. Let S' := {5' | s e S}. Consider the bipartite multigraph G = (S, S';E), on node 
classes S and S', and having precisely l(s,s ! ) edges with one endnode in s and the other in 
s' . Clearly, for every s £ S, E contains precisely Ai edges incident with s, namely the edges 
{s, L (s, i)} for i £ Na, ■ Moreover, for every s' € S, E contains precisely Ai edges incident 
with s', since the list system is proper (and by (Q)). Our problem is to find an edge-coloring 
of G with «2 (> Ai and such that «2 divides n\A\) colors and such that each color class has 
size precisely A2 := ^j^ 1 - 

Let V be a set of n\ — A2 new nodes and V' := {V | v E V}. Let H\ — (V,S';Fi) be any 
bipartite («2,«2 — Ai)-regular bipartite graph on nodeclasses V and S'. Leti/2 = (V',S;F2) 
be any bipartite («2,«2 — Ai)-regular bipartite graph on node classes V' and S. Consider the 
bipartite «2-regular multigraph G= (SUy,S'UV';£ , U/ r i LiFz). By Konig's theorem [Konig 



1916b; Konig 1916a], we can edge-color G with «2 colors, that is, we can decompose 
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E U F\ U F 2 into «2 perfect matchings Mi,... , M„ 2 of G. We propose M\\F\\F2,... , M„ 2 \ 
F\ \ F2 as the required edge-coloring of G. Indeed, M\ \ F\ \ F2, ■ ■ ■ ,M„ 2 \F1\F2 is a 
decomposition of E into 112 matchings of G and \Mj\Fi\F 2 \ = |M,-| - (|V| + |V'|) = («i + 
\V\) - 2JV] = n\ - |V I = ni -n\ + A 2 = A 2 , for every i = 1, . . . , A. □ 



REMARK 1 . 77ze above proof is algorithmic. The computational bottleneck is in com- 
puting a l-factorization of a bipartite n2-regular multigraph on n := An\ — 2A 2 nodes 



and with m := ««2 edges. This can be done in 0(n2>n) as in [Schrijver 1999] or in 



described in [Rizzi 2001 ]. 



0(mlogn2 + ~ log ~ log»2) a* zn [Kapoor and Rizzi 2000] and in virtue of the algorithm 



3.2 Permutation Routing: the Main Theorem 

The following theorem describes our main result. Note that the routing found by Theorem 
U has the property that at each step of computation each processor stores exactly one packet. 

THEOREM 2. A POPS (of, g) network can route any permutation % among the n = dg 
processors using one slot when d = 1 and 2 \d/g~\ slots when d > 1. 

PROOF. When d = 1, a POPS(l,n) network is equivalent to an n processor clique, the 
network is fully interconnected, and the claim of the theorem is thus trivial. 

Now, consider the case when 1 < d < g. We will show that K can be routed m2\d/g] =2 
slots. Take the list system (N^,N s ,x), where L : N ? x N d ^ N g is such that L(h,i) = 
group(3t(j + hd)), h £ N g ,i e N^. The list system is proper, since K is a permutation, andg 
clearly divides gd. By Theorem [j], (N g ,N g ,L ) admits a fair distribution / : N ? x N ( / 1— > N g . 
Consequently, / maps every pair (h, i) to an integer from N ? in such a way that: 

I I i G N d }| = d for every h G N ¥ ; (4) 
|{(A,0 G n g x N d I f(h,i) = j}\ = d for every j G N g ; (5) 
if (/ti,ii) 7^ (/t2,fe) and£(fci,2'i) =£(h 2 ,i2), then/(/ii,/i) ^f{fi2,iz), 
for every /11 , /12 G N g and every /1 , ('2 G 



(6) 



Permutation Jt is routed in two slots. During the first slot, n packets are routed through n 
of the g 2 couplers of the POPS network, and, precisely, the packet originating at processor 
i + hd is sent through coupler c(f(h,i),h), h G N g ,i G Nd- No conflict can occur on any 
coupler by equation (Q). Moreover, exactly d packets arrive at group h by equation (||), 
hence, it is easy to assign a distinct processor to read each of the incoming packets. After 
the first slot, the n packets are fairly distributed by equation (^). Consequently, a second 
slot is enough to route all of them to destination by Fact [j]. 

Finally, consider the case when d > g. Take the list system (N g ,Nd,£), where L : 
N g x Nd i-> N g is such that L(h,i) = group(rc(i + hd)), h G N g ,i G N c /. The list system 
is proper, since n is a permutation, and d clearly divides gd. By Theorem [j], (N g ,Nd,x) 
admits a fair distribution / : N g x Nd >—> Nd- Consequently, / maps every pair (h,i) to an 
integer from N g in such a way that equation (Q), equation (Q), and the following equation (Q) 
hold. 

\{{h,i) G N g x N d | f(h,i) = j}\ = d for every j G N d . (7) 

Permutation 71 is routed in \d/g] rounds. Each round k, k = 0, . . . , \d/g~\ — 1, consists 
of two slots. During the first slot of all rounds but the last one, g 2 packets are routed 
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through the g 2 couplers of the POPS network, and, precisely, the packet originating at 
processor i+kg + hd is sent through coupler c(f(h,i + kg),h), h E N s ,i G N g . No conflict 
can occur on any coupler by equation (Q). Moreover, exactly g packets arrive at group h by 
equation (^]), hence, it is easy to assign a distinct processor (among the g which just sent 
a packet) to read each of the incoming packets. After the first slot, the g 2 packets which 
moved are fairly distributed by equation (||). Consequently, a second slot is enough to route 
all of them to destination by Fact [l]. The last round is exactly identical to the previous ones 
when g divides d. Otherwise, only g(d mod g) packets are routed in a similar way. After 
\d/g~\ rounds all packets are correctly routed to destination. 

The routing is completed after \d/g~\ rounds, and each round consists of two slots. 
Consequently, 7C is routed using one slot when d = 1 and 2\d/g~\ slots when d > 1, as 
claimed. □ 

The routing described by the previous theorem can be computed efficiently. The bot- 
tleneck consists in finding a fair distribution for the list system described by K, as in The- 
orem [j] and Remark [j]. It is easy to see that this can be done in 0(g 3 ) or 0(g 2 logg), 
when 1 < d < g, and in 0(dn) or 0(n\ogd) time, when d > g, by using the algorithms 
in [ [Schrijver 1999| and [ |Kapoor and Rizzi 2000| ; Rizzi 2001 1, respectively. 



3.3 Optimality 

Theorem || is not far from optimality for almost all permutations. Indeed, if K is such 
that Jt(z') ^ i for all i, then the routing found by Theorem || uses at most the double of the 
optimal number of slots. 

Proposition 1. If n is such that n(i) ^ ifor all i, then a POPS (of, g) network must 
use at least \d/g~\ slots to route K. 

PROOF. Under the above assumptions, all packet destinations are different from the 
source. Hence, at least one slot is needed by each packet to reach the desired destination. 
Since a POPS(<i,g) network can move at most g 2 packets per slot, \n/ 'g 2 ] = \d/g~\ slots 
must be used to route all the packets. □ 

Moreover, there exist permutations for which Theorem ^| is optimal. One example is 
vector reversal (when g is even), the proof can be found in [ [Sahni 2000a |. A straight- 



forward generalization of the proof in [ [Sahni 2000a[ ] shows that many other permutations 
have the same property. 

Proposition 2. Ifn is such that group(/) ^ group(7c(/)) and 

group(/) = group(y') => group(7C(7)) = group «/)) 

for all i and j, then a POPS (of, g) network, dg = n, must use at least 2\d/g~\ slots to route 
71. 

Finally, also when the assumption that group(/) ^ group(7t(/)) is removed our algorithm 
gets very close to an optimal number of slots. 

Proposition 3. Ifn is such that n(i) ^ ifor all i and 

group(i) = groupQ) group(7i(/)) = group(7c(;')) 

for all i and j, then a POPS(d,g) network, dg = n, must use at least 2[if/(l + g)~\ slots to 
route 7L 
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Proof. Suppose that a POPS(c/,g) network can route Jt in t slots. If t > d, then it is 
easy to see that t > 2\d/{\ +g)~\ ■ Hence, we can assume without loss of generality that 
t < d. 

Since group(/) = group(y') =>• group(7t(/)) = group(7t(_/')), at most t packets per group 
can be routed to destination in one slot only. All the other packets, at least d — t per group, 
have to perform at least 2 hops to get to destination. Taking into account that a POPS(c/,g) 
network can move at most g 2 packets per slot, then tg 2 > gt + 2g(d — t), which implies that 

t>2\d/(i+ g y\. a 

4. CONCLUSION 

A few papers appeared in the recent literature describing how data can be moved efficiently 
in a POPS (d,g) network. In particular, several permutation routing problems have been in- 
dependently attacked in order to show they are mutable in one slot when d = 1 and 2\d/g] 
slots when d > 1 . With Theorem ||, we demonstrate that exactly the same result holds 
for any permutation 71, and that the routing for ji can be efficiently computed. Moreover, 
the number of slots used is optimal for a class of permutations, and at most twice of the 
number of slots required by any permutation 7C such that 7c(z) ^ i for all i. 
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